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Abstract 

Gravitational lensing allows us to probe the structure of matter on a broad range 
of astronomical scales, and as light from a distant source traverses an intervening 
galaxy, compact matter such as planets, stars, and black holes act as individual 
lenses. The magnification from such microlensing results in rapid brightness 
fluctuations which reveal not only the properties of the lensing masses, but also 
the surface brightness distribution in the source. However, while the combina- 
tion of deflections due to individual stars is linear, the resulting magnifications 
are highly non-linear, leading to significant computational challenges which cur- 
rently limit the range of problems which can be tackled. This paper presents 
a new and novel implementation of a numerical approach to gravitational mi- 
crolensing, increasing the scale of the problems that can be tackled by more than 
two orders of magnitude, opening up a new regime of astrophysically interesting 
problems. 
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1. Introduction 



Since the identification of the first multiply- imaged quasar (jWalsh et al 



1979f ). gravitational lensing has been used to p robe the dist r ibutio n of matter 
on many ast rophysical scales, such as planets I Gaudi et al. . 20081 ), individual 



galax i es (e.g. iBrewer fc Lewi; 



ESool 



Dye et al 



20081) . cl usters (e.g. lDeb et al 



20081 ISand et all 120081 ). and large-scale structure (e.g. iKitching et all I200S 



Wang et al. L l2009h . Soon after the discovery of the first gravitational lens, it 



Email addresses: hgarsdenaphysics.usyd.edu.au (Hugh Garsden), 
gfl@physics.usyd.edu.au (Geraint F. Lewis) 

1 Research undertaken as part of the Commonwealth Cosmology Initiative (CCI: 
www.thecci.org), an international collaboration supported by the Australian Research Coun- 
cil. 



Preprint submitted to New Astronomy 



July 1, 2009 



was realis ed that individual s tars w ithin a lensing galaxy would themselves act 
as lenses ( Chang fe; Refsdal , 19791 ). and while the separation of the resulting 
images would be too small to be resolved, this microlensing can significantly 
magnify the background source. Furthermore, given the density of compact 
objects within galaxies (such as stars and possible dark matter), it was clear 
that many objects will influence the path of a beam of light as it traverses a 
galaxy, and beyond considering only a couple of lensing masses, the study of 
microlensing becomes analytica lly intractable and must be tackled numerically 
( Yountd . Il98tt iPaczvnskil 1 19861 ). 

T he first observatio ns of microlensing were made in 1989 ( Corrigan et all 

1990Ulrwin et al.l . ll989l ). in 2237+0305 . a distant quasar (z = 1.695) lensed into 
four images by a nearby spiral galaxy ( Huchra et a l.. 1985); it has been photo- 
metrically monitored for two decades ( Udalski et all 120061 ) , revealing exquisite 
light curves for each of the images, clearly revealing the presence of gravitational 
microlensing. Furt hermore, the number of potentially microlensed quasars has 



steadily increased jE igcnbro d et al 



Reimers et al.. 2002; Turner et al 



, 2006; Gu nn et al. 
19891 : IVanderriest et al.l 



19791; iMvers et al.l . ll999l; 



1983b and hence 



efficient means of theoretically understanding gravitational microlensing is re- 
quired. 



2. Microlensing Ray- Tracing 

2.1. Mathematical Framework 

This paper does not intend to derive the basic gravitation al lensing equa- 
tions, the reader is d irected towards the excellent textbook by ISchneider et al 



(1992) and thesis by IWambsgansd ( 1990h . For microlensing calculations a nor- 



malized lens equation in the following form is used to map the ray from the 
observer, through the lens, and into the source: 

Here, x and y are the location of a light ray in the lens and source plane 
respectively, x\ and rrii are the mass and co-ordinates of the ith microlensing 
star. For the remaining terms in this expression, a c represents the local density 
of smooth matter in the galaxy, whereas 7, known as the shear, encapsulates 
the large scale asymmetry in the galaxy mass distribution. The surface density 
of compact objects (assumed to be uniform), is also used in modelling and 
simulations and is given by o~ s . The distance scale used in lensing is the Einstein 
Radius (ER) , the radius of a ring produced by lensing of a source that is directly 
behind a point lens in a line from lens to observer. Einstein Radii are used to 
refer to the extent of observed images on the sky, and are derived from the mass 
of the lens and the distances to the lens and source. 
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Figure 1: An example of a gravitational microlensing magnification map generated via inverse 
ray tracing. The lens consists of 435 stars each of 1 solar mass (Mq), randomly distributed, 
with mass parameters cr s = 0.39, 7 = 0.10. The viewing window is 8 ER in width. Light 
shading corresponds to regions of high magnification, whereas darker regions represent de- 
magnification. 



2.2. Numerical Approach 

Clearly, Eqn [I] represents a mapping from a source location, y, to a number 
of image (or observed) positions x. However, it is apparent that this mapping 
is one-to-many, with a single source resulting in a number of images (dependent 
upon the number of microlensing masses AT*). However, reversing the mapping 
from the image plane ba ck to the sourc e is one-to-one and hence an "inverse 
ray-tracing" mechanism ( Glassner . 1989h is typically employed in the study of 
microlensing. With this, the observer sends out a large number of rays into 
the image plane. For each ray, the deflection angle is calculated through EqnQ] 
and the ray is mapped into the source plane and collected on a grid. After 
all the rays are followed, the density of rays in the binned-up map of their 
source positions is directly proportional to the magnification of a source at 
that location; Fig [T] presents an example of such a map, where light regions 
correspond to a high density of collected rays (i.e. strong magnification) whereas 
dark regions correspond to the opposite situation. 

A deeper examination of Eqn [T] reveals that the computationally intensive 
aspect of undertaking such ray tracing is the sum over the microlensing masses 
when calculating the deflection angles; in typical simulations, there could be 
10 s — > 10 6 masses, with the tracing of ~ 10 11 rays required to achieve suffi- 
cient density in the s ource plane. Following the seminal work by (jKavser et al 



19861. IWambsgansd (fl999) implemented an inverse ray-tracing approach which 
has become the "industry standard" , utilizing a tree-algorithm to ease the sum 
over the microlensing deflection angles. The advent of this approa ch has allowed 



the analysis of the magnification patterns of microlensed q uasars dWambseanss , 
19921) . the structure of quasar broad line emission regions (jKeeton et all I' 



2006 
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(a) 100% = 1 Mq, number of stars = 30496 (b) 98% = 0.01 M , number of stars = 

2989306 




(c) 98% = 0.001 Mq , number of stars = (d) 98% = Smooth, number of stars = 609 
29887581 



Figure 2: Magnification maps from simulations of bi-modal mass distributions. A total mass 
of 30496 Mq is distributed in a circular lens plane of diameter 500 ER, with a viewing window 
of 20 X 20 ER. All simulations use 7 = 0.4250, with (a), (b) and (c) having a s = 0.4750, cr c 
= and (d) having a s = 0.0095, cr c = 0.4655, so the total mass is the same in all. The first 
simulation (a) is a single-mass distribution of stellar masses, then in (b) and (c) some of the 
stellar mass is replaced with smaller compact objects, then in (d) that mass is replaced with 
smooth matter. 
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Lewis fe Ibata , 2004 ) , chromatic effects in microiensi ng dWambsganss fe Paczvnski , 
1991 ) , the na t ure o f dark matter in lensing galaxies ([Lewis fe Gil-Merinol 120061 ; 



Poolev et all 120081 ; Schcchtcr & Wambsganss . I2002T). and the effect of source 



size on microiensing ( Bate et al. . 2008 ; iMortonson et a l.. 2005), among others 



3. New Physical Challenges 



S e veral quasar systems appear to poss e ss anomalous flux rat ios (Black burne et al 
2006t lEigenbrod etail l200a lOta et all l2006t iPoolev et all l2006h 



meaning 

that the observed image brightnesses differ significantly from predictions drawn 
from gravitational lens models possessing mass distributions that are smooth 
on galactic scales. Two key hypotheses have been put forward to explain 
these observations; either these anomalous ratios are due to m illilensing by 
10 6 M(7) clumps of dark matter i n the halo of the lensing galax y ( Chibal . 12002c 



ig g 

Dalai fc Kochanekl . 120021 lMetcalj[200ll ; lMao fc Schneiderl . ll998] ). or the quasars 
are microlensed by star s embedded in an overall smooth dark matter distribu- 
tion ( Witt et al. . 1995f) . The former proposal is attractive as it may provide 



the first direct measurement of the missing substructure expected from galactic 
build-up in cold dark matter formation scenarios. The latter proposal is also 
important, potentially probing the fundamental nature of dark matter, as will 
be explained below. 

The original proposal bv lWitt et al. (1995) considered a smooth dark matter 
component which actually suppresses the image flu x for long periods, leading 
to the apparently anomalous flux ratios. However, ISchechter fc Wambsganssl 
(2002) questioned this hypothesis, suggesting that the dark matter could in fact 
be composed of substellar compact objects, and that the existence of such a com- 
pact component could imprint itself on the resulting gravitational lens statistics. 
In studying the mic roiensing hypothesis, a number o f of numerical simulations 
were undertaken bv lSchechter fc Wambsgarissl ( 2002 ) using the backwards ray 
shooting approach, and Figure [5] illustrates several examples of the parameters 
employed. Figure [2^a) simulates a galaxy of a few thousand stars all of one 
solar mass (M Q ), with a s — 0.475 and 7 = 0.425. Figures [^b) and (c) use the 
same parameters but two different masses for a s : 2% of the mass is in stars of 
mass M©, and the rest is contained in small compact objects, 0.01M Q in (b) 
and O.OOIMq objects in (c). The final panel, Figure [^d), presents the case 
of a smooth dark matter component (a s — 0.0095, a c = 0.4655, 7 = 0.425); 
as shown in ( Lewis fc Gil-Merinol . 20061 ). when the mass of the compact dark 
matter component is made smaller, the general scale of caustic structure is re- 
duced, although large scale caustic features, due to the presence of the solar 
mass stars in the simulations, remain. In comparing the lower two panels in 
Figure [H the large scale caustic structure is the same in the small compact 
and smooth dark matter cases, but the smaller scale stru cture is quite differ- 
ent (in fact, it is non-existent in the smooth matter case). iLewis fc Gil-Merino 
(2006) went on to show that, even though the large scale caustic structure is 
similar, the presence of the small scale caustics imprints itself on the statistical 
properties of the microiensing, with the lower panels possessing quite different 
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magnification probability distributions. However, convolving these maps with a 
finite source radius washes out the small scale caustic structure, and for large 
enough sources the convolved compact mass and smooth mass magnification 
maps become statistically indistinguishable. This implies that there is a source 
scale size below which any compact matter component would appear as com- 
pact, and above which it would appear smooth, and hence observations of such 
systems at differing wavelengths (corresponding to differing source sizes) would 
probe the fundamental properties of dark matter. 

However, the simulations undertaken to date have been limited by the com- 
putational power and memory of available computers; anomalous flux quasars 
are those that typically have been strongly magnified, requiring ray shooting 
through a large area to be statistically meaningful. Furthermore, these lim- 
itations place constraints on the number of dark matter lenses that can be 
considered, effectively placing a limit on the mass scale that can be simulated - 
around 10 7 objects of mass 0.001 M©. Hence, it was decided that a supercom- 
puter implementation of the ray-shooting approach was required to tackle these 
physically interesting regimes. 



4. A Supercomputing Approach 

The numerical ray-shooting approach employs several approximations that 
have little impact on the accuracy but allow a significant improvement in perfor- 
mance. An examination of Equation [1] reveals that ar" 1 calculation is required 
for each lensing mass, the total deflection requiring a sum over all of these indi- 
vidual masses; of course, the calculation of the distance, r, involves the square 
roots, significa ntly adding to the c omputational load. The ray-tracing approach 
as adopted by IWambsgans reduces this load by use of a multipole ex- 



pansion of the gravitational interaction for lensing masses far from the passage 
of the ray through the field of masses. For this, a cell tree is constructed by 
subdividing the distribution of lensing masses, with a constant regular subdi- 
vision until each cell holds a maximum of a single lensing mass. An opening 
angle criteria is then applied to distinguish between employing nearby masses 
as individual lenses, whereas more distant lens masses are grouped and a mul- 
tipole expansion of their distribution is used to approximate their gravitational 
influences. 

The use of the cell-tree substantially improves simulation run-time but sub- 
stantially increases the amount of data required to run a simulation. All data 
is held in computer memory, and consists of: 

• A 2-D array for the pixel map. 

• An array of "stars", each element containing the mass and location of a 
star in the galaxy. We will use the term "stars" to refer to all objects 
constituting the lens. 

• An array for the cell tree, each element containing information about a 
cell and its location within the tree. 
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A typical simulation on a desktop computer may contain a pixel array of 
2000x2000 pixels and of order a million stars. This corresponds to 15 Mb for 
the pixel array, 38 Mb for the stars array, and 89Mb for the cell tree. The rays 
have no data existence; the result of shooting a ray is a pixel hit which is added 
to that pixel's hit count. In the following, we outline the basic operation of 
the standard microlensing approach and the changes required to employ it in a 
supercomputer environment, successfully handling the significant computational 
and storage requirements to push the application of microlensing into a new 
scientific regime (see Section [5]). 

The starting point is to understa nd the operation of the standard ray shoot- 
ing approach of I Wambsganssl ( 1999h : this is expressed in the following pseudo- 
code 

/ / Setup phase 

Generate the stars, using a random distribution of locations and 
masses, within certain parameters 

Sort the stars by encoding indexes 

Generate the cell tree from the stars 

Sort the cell tree by encoding indexes 

/ / Shooting phase 

Loop 

Shoot a ray 

Increment the pixel hit array where necessary 
Move to a new location in the lens plane 

Until the whole lens plane has been covered 

The sorting operations of the stars and cell tree are required because during the 
shooting of the rays, searches are performed to determine the break between 
the employment of individual masses and expanded cells that are to be used for 
a particular ray, with the configuration changing as we consider rays traversing 
differing locations in the field of stars. 

As noted in Section [31 we wish to increase the number of stars under consid- 
eration. Immediately, this significantly impacts the undertaking of microlensing 
simulations by desktop computers (which are limited by virtual memory to ~ 20 
million stars), increasing the storage required for the masses and locations, as 
well as the tree structure. Furthermore, we additionally require an increase of 
the number of rays fired; as can be seen in Figure^ the inclusion of small masses 
induces small scale structure in the magnification map, and to fully resolve this 
it is essential to increase the number of pixels covering the source, and the num- 
ber of rays traced, to ensure we are not limited by counting statistics in the 
final map. In moving to a supercomputer environment, the following changes 
are required: 



Adapt the algorithmic approach of IWambseansl (|l999h . 
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• Parallelize the approach so it can exploit current multiprocessor comput- 
ers; ray tracing is inherently parallclizable, as the rays are usually indepen- 
dent of each other. There are typically billions of rays in a microlensing 
simulation, which can be distributed in some appropriate way across many 
processors and shot concurrently. 

• Increase the number of stars that can be considered, into the billions. This 
is an increase of a hundred-fold, and it is unlikely the data can be held in 
RAM with any reasonable architecture. Other strategies will be required 
to store the massive amounts of data that will be used. 



Chang e to an object-oriented paradigm; the approach of Wambsganssl 



(1999) was originally implemented in Fortran 77 with procedural code and 
global data. It was decided that the supercomputer implementation would 
employ C++ as some objects such as the stars and cells naturally lend 
themselves to implementation as CH — h class objects. Re-implementing in 
CH — h will provide a sound basis for future work. 

As the reimplementation in CH — h is a design and development process we will 
describe it only briefly here. Data elements, such as stars and cells, were iden- 
tified and reimplemented as C++ class types. The Fortran lensing algorithms 
were rewritten in C++ and combined with the C++ classes, producing a work- 
ing program. The arrays that store the stars and cells were replaced with a 
C++ module that implements arrays as disk files (described later), but with 
a similar interface. Then the implementation was parallelized by sharing out 
the generation of stars, and the ray-shooting, to different processors (this is 
described later). 

The rest of the paper focuses on the strategies used to parallelize our ap- 
proach and increase the size of lenses available for microlensing research. 

4-0.1. Parallelizing the ray shooting 

For the supercomputer implementation, the rays that are shot need to be 
distributed across t he available processors . There are many w ays of parallelizing 
ray-tracing ( Amri . 20031 : Lee et al. . 1998; IVerdu et al.l . 1996) but little experi- 



ence with this in gravitational lensing simulations. It does not matter which ray 
is executed on which processor, or in what order, as they are completely inde- 
pendent. We consider only static assignment of rays to processors, as dynamic 
assignment for load-balancing purposes is much more complicated and static 
assignment should be tried first. Dynamic load balancing is a model where 
job tasks (in this case rays) are assigned to processors at run-time based on 
the load on each processor as the simulation proceeds - idle processors get the 
next jobs, much like a multiprocessor operating system running user programs. 
Much more time and effort is required to implement dynamic load balancing 
compared to static load balancing. Static load balancing means that when the 
program runs, some algorithm at startup determines which ray is assigned to 
which process, and that is fixed for the duration of the simulation. 
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Therefore we are looking for a mapping 

Assign(Ray) i— > Processor (2) 

which maps rays to processors for that run. There is a value for Assign which 
is optimal for a particular simulation, but it is unknown, so some heuristic is 
required. Mapping an equal number of rays to each processor is sufficient if 
there are many more rays than processors and each ray takes the same time. 
The latter is probably not going to be the case since the stars in the lens may 
not be not distributed evenly, and rays close to a lot of mass will take longer 
to process. We have adopted an interlaced ray distribution model: take a 2- 
D element dA of the lens plane and distribute the rays in that element across 
all processors, and do the same for all other elements. This means that every 
processor simulates lensing over the entire lens plane, but only some rays at 
each point of the lens plane. We hope that this will even out the load without 
the need for load balancing. 

To implement parallelization we could employ either a multithreading or 
multiprocess model. Multiprocessing means that several parallel processes are 
created by the computer operating system and they must access shared data 
outside the program, for example on disk. In multithreading there is only a 
single process with parallel threads of execution running inside it, and all global 
data in the program is automatically shared between threads. Multithreading 
is usually more efficient but also more error-prone in development, so for a first 
implementation we opted for multiprocessing. 

In the setup phase the generation of the stars can be parallelized by dis- 
tributing the load evenly across all processes. The generation of the cells from 
the stars is a sequential process based on sophisticated location and tree index- 
ing which cannot be easily parallelized. The sorting steps could be parallelized 
but have not been; this will be discussed in the next section. The shooting 
phase can be distributed across all processors. 

4-1. Increasing the data storage 

Pushing into the regime of ^billion stars in a lens requires a stars array and 
cell tree of order 200GB; the following are possible ways of storing this data: 

1. On a 64-bit computer with 200Gb RAM it could be stored in memory as 
it previously has been. 

2. If simulations are to be run on many computers at the same time, and 
the computers collectively have enough memory, distribute the data in the 
memory of those machines. 

3. Put the data in files on a shared disk, not in RAM. 

Option 1 is possible and requires no modification of the data storage, although 
clearly requires a particular computer architecture not available in typical (i.e. 
beowulf ) clusters. Option 2 is possible if enough machines are used with enough 
RAM; then data could be spread amongst them and fetched across the network. 
Option 3 can be used anywhere there is sufficient disk space, even on desktop 
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computers. In our new implementation all three options were combined, as huge 
amounts of disk space are now common (more than large RAM machines) , the 
simulation data is placed on disk, but partially cached in memory so that all 
the available RAM is used. 

The stars array and cell tree were each replaced with a binary file, but still 
accessed like arrays, so that any object at any part of the file could be requested 
with an index. Naturally this means that file I/O will dramatically slow down 
the simulation, although it is envisaged that caching strategies and parallelizing 
the ray-shooting may compensate for this. Since shared files are trivial on 
modern networks, multiple processes can read data from a single set of shared 
files. When starting a simulation, several processes can create the files and the 
others wait until they are ready, or, each process can generate its own version 
of the files, which is very wasteful of disk space but means some processes may 
be able to start earlier. 



4-2. Sorting with files 

The sorting of stars and cells is inherent in the algorithms in the set-up phase 
that code the stars and cells for indexing and searching; it cannot be avoided 
without significant redesign of the cell tree usage. In t he original implementation 
where data is in RAM, a heap-sort algorithm is used ( KnuthlTl998l ). A heap sort 



swaps elements from the end of an array to the beginning, gradually building a 
sorted array in place of the original values. When the stars and cell tree were 
placed in files the sorting steps became extremely slow - as expected. A run 
of one billion stars would take an intolerably long time and be unworkable. In 
fact, when the data was initially placed in files and a simulation of one billion 
attempted, the sort was still running after two days and we gave up on it. 
The best way to p r ocess files is sequentially, and best way to sort files is with 



a merge sort (jKnuthl . Il998| ) . We implemented a folding merge-sort algorithm 
for the sorting steps. In the original implementation the stars and cell tree were 
sorted after they were generated; a folding merge-sort does the sorting while the 
data is generated. The algorithm is as follows; as stars are generated they are 
placed in a memory cache. When the cache is full, it is sorted using a heap-sort 
and written to the stars file. This is the first instance of the file. Further stars 
are generated, and placed in the cache. When the cache is full it is heap-sorted 
and then merge-sorted into the file, producing a new instance of the file. The 
process continues until all stars are in the file, and the file is always sorted. The 
same process is done for the cell tree. 

This mechanism substantially improved the run-time of the sorting phases. 
There are however disadvantages: 

• The space required for a merge-sort is double the data size, due to the 
fact that a new version of the file is being created as the existing one is 
being read 

• As the file gets bigger the merge gets slower 

Nevertheless this method has provided us with a usable implementation. 
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4-3. Shooting cache 

The memory cache used for merge-sorting is only required during the setup 
phase; after that the memory is available for shooting. As rays are shot, a per- 
ray stars and cell configuration is used to calculate the deflection, and analysis 
of the usage patterns in the sequential version showed that around 66% of stars 
and cells used for one ray are used again for the next, as the rays are close 
together in space (assuming a large number of them). 

Based on this information we implemented a shooting memory cache that 
allows re-use of stars and cells, from one ray to the next. Unfortunately this 
scheme interacts with the interlaced ray distribution scheme, in that the scheme 
means each process is executing rays that are slightly further apart on the lens 
plane than they normally would be. This degrades the performance of the 
shooting cache slightly. 

4-4- File cache 

The amount of memory needed for the stars/cell configuration for a ray is 
not much (a few Mb), leaving plenty of RAM free. This RAM could be used as 
a more general memory cache for storing some stars and cells that are otherwise 
on disk, making them faster to access. Implementing a general cache of file 
data requires an analysis of the usage patterns of the stars and cell data files, 
and at a first-order approximation there are two patterns operating: depth first 
traversals of the cell tree and binary searches in both the stars and cell arrays. 
A depth first traversal will follow a single branch of the cell tree from the root 
down to a "leaf, traversing the file in a sequential fashion; a binary search will 
jump around from one end of the file to the other. These conflict, and make it 
difficult to generate heuristics for anticipating what portions of the files to load 
into memory, so we have chosen a simple scheme to begin with: cache the first 
levels of the cell tree and the first records of the stars file, as many as possible. 
The root of the cell tree is heavily accessed and caching it proved useful. 

5. Results 

We discuss the performance of our new method and then give some prelim- 
inary results of its application to microlensing. 

5. 1 . Performance 

The following performance statistics were obtained from simulations using 
10 7 stars with a pixel map of 2000x2000. The number of rays to shoot are based 
on these and other parameters, in this case it is 1.3xl0 9 . The performance 
depends greatly on the amount of RAM available for caching. For the test runs 
reported here, we fixed the size of the cache per parallel process to be about 
l/8th the size of the stars data, because this is the ratio for a simulation of 10 9 
stars run on a common 32-bit desktop machine, something we may consider as 
a "standard" for comparisons. 
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Version 


Total 


Shooting 


Setup 






time 


time 


time 


Fortran 


vl 


5.64 


5.57 


0.06 


Fortran, optimized 


v2 


1.98 


1.93 


0.05 


CH — |-j Fortran copy 


v3 


5.38 


5.34 


0.04 


CH — h, copy, optimized 


v4 


2.15 


2.12 


0.03 


CH — \-, data in files, 


v5 


20.93 


19.72 


1.21 


no caches 










CH — h, add merge-sorts 


v6 


4.20 


4.01 


0.18 


and file caches 











Table 1: Run times (hours) for different versions of the new approach, single process 



Table [T] shows run-times for different versions of the new tool, beginning 
with the original Fortran (vl) and progressing through some C++ versions to 
the final version (v6). All times are in hours. The first C++ version that was 
generated (v3, v4) was a straight copy of the Fortran (vl, v2). The run-times 
for both languages are similar. Of note is the fact that both run 2.5-2.8 times 
faster when full compiler optimizations are used, due to the algorithms being 
computationally intensive and these versions having no I/O operations. The 
CH — h is slightly computationally faster when unoptimized, the Fortran slightly 
faster when optimized, but they are close, so we have not lost much in the move 
to C++. The same compilers (GNU C++ and GNU Fortran 95 from FSF0) 
were used for both languages, but since there are overheads in C++ due to the 
use of object-oriented classes, it is not surprising that C++ is slightly slower 
in the best-case scenario. We have not investigated the relative performance 
when both are unoptimized. Different compilers with different optimization 
strategies can always produce a faster (or slower) program, so our comparisons 
do not constitute a formal benchmarch, merely the performance for commonly- 
used compilers (i.e GNU). Further investigation can be made in the future to 
determine the compiler that is best suited for this program. 

After taking the arrays out of memory and placing them on disk the per- 
formance, as expected, degrades considerably (v5). The run is 10 times slower. 
Although the stars and cells are generated in a sequential fashion, they are then 
heap-sorted, which is very inefficient inside files. Using the folding merge-sort 
(v6) avoids inefficiencies because it scans and merges file sections sequentially. 
In v5 when shooting begins there is no file caching; star data and cell data 
is loaded from all over the files, and the I/O and spread-out file access makes 
shooting very slow. In v6 the shooting cache and file cache are implemented 
and they improve performance time by reducing the access to the star and cell 
data files. The techniques used to improve performance in the files version (v6) 
bring the run-time down so that is is comparable to the unoptimized in-memory 



1 Pree Software Foundation, Inc. 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA 
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version - a good result. 

5.1.1. Parallel speed-up 

Versions 4 and 6 are able to run in parallel mode, dividing work between 
processes. Version 4 maintains data in memory and is can be used when the 
number of stars is up to 10 7 ; version 6 uses files and must be used for simulations 
of more than 10 7 stars. 

Figure[3]shows the speedup obtained for these versions using the new method. 
Version 4 (solid line) uses parallelism during ray-shooting. It achieves a perfect 
speedup during ray shooting, showing that all the theoretical parallelism that 
is available can be extracted in practice by the right implementation and in the 
right circumstances: i.e. each ray is completely independent of the others, there 
are far many more rays than processes, our simple ray distribution scheme is 
achieving a good load balance, and the data is in memory and can be accessed 
concurrently. 

Version 6 uses parallelism in two places: during the generation of stars in 
the setup phase, and during ray shooting. These are indicated by the dashed 
and dotted lines respectively. The stars generation shows a reasonable speedup 
but then tapers off, the shooting phase only gains a speedup of 3. The reason 
for the poor speedups in this file based implementation is serialization of I/O 
requests to the data files. To stop this happening it would be necessary to 
give each process a copy of the files, placed on different disks, and accessed 
through different I/O controllers. We note again that the performance of the 
new approach is now much more dependant on the computer on which it is 
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run. Currently we do not have a different configuration to test our simulations 
on, but a 2-3 times speedup is acceptable enough, and in a real situation this 
particular simulation would use enough RAM so that all the data was cached. 

5.1.2. Summary 

To investigate the properties of quasar microlensing we enhanced a well- 
known microlensing implementation by massively increasing the number of stars 
that can be used, and executing some phases of the algorithms in parallel. The 
simulation is run on supercomputers, data is placed in very large files, but 
parallelising the algorithms and adding various caching schemes and a different 
sort strategy has brought the run-time to acceptable levels. We can now run 
simulations in a time comparable to the original method, but using 10 9 stars, 
100 times more than in the past. 

5.2. Microlensing results 

Using our new implementation we can push down the mass size of the com- 
pact masses in bi-modal simulations, while at the same time increasing their 
number. Figure 2] shows simulations of quasar microlensing models that go 
down to 0.000025 M Q for small compact mass objects; this is about 10 times 
Earth mass. Future simulations will go to smaller masses and possibly up to 
3 billion stars. Of most interest is what happens to the magnification map 
as the masses become smaller. Figure |4] begins at case (c) of figure [2] and re- 
duces the mass to 0.000025 Mq, increasing the number of stars to over a billion. 
With these small masses the large-scale caustics are becoming clearer and the 
fine-structure in-between the caustics is becoming smoother. Initial analysis 
indicates that as the small masses become smaller and more numerous the mag- 
nification distribution will tend to that of the smooth-matter case (d), but the 
size of the source will be important. Results of more complete investigations 
will be reported in forthcoming papers. 

6. Discussion 

Our new approach is of interest in itself, and can be studied as an instance 
of a large, parallel, computationally intensive implementation. Although our 
current aims excluded modifying the physics code, that can still be done at 
some later time. 

6.1. Software Engineering 

The following could be worthwhile scientific research projects based around 
our implementation: 

Optimum distribution of rays to processors. Do rays that are close to 
clumps of matter really take longer to process than those that aren't? And by 
how much? How close to optimal is our simple static load-distribution scheme? 

Dynamic load balancing. Instead of fixing the allocation of rays to pro- 
cessors, the allocation could be varied at run-time, so that rays are moved from 
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(a) 2% = Mq, 98% = 0.001000 M , total (b) 2% = M Q , 98% = 0.000675 M Q , total 
number of stars = 29887581 number of stars = 4920276 




(c) 2% = Mq, 98% = 0.000025 M , total (d) 2% = M Q , 98% = Smooth, total number 
number of stars = 1195479475 of stars = 609 



Figure 4: Magnification maps from simulations of bi-modal mass distributions; these use the 
same parameters as figuref2]but use smaller and more numerous masses as the compact matter 
(<r s ). As the size of the masses is decreased the similarities between the use of compact masses 
(a)-(c) and smooth matter (d) become more marked. 
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heavily loaded processors to lightly loaded processors. We could implement mi- 
crotasking where each ray could be a task, with a scheduler pulling jobs off a 
run-queue and assigning them to processors for execution. We have not inves- 
tigated whether the load on processors varies enough to warrant the overhead 
of dynamic load-balancing; our impressions are that it does not, for very large 
runs. 

Distributed memory. A simulation of billions of stars generates hundreds 
of gigabytes of data, and this data has to go somewhere. If there are enough pro- 
cesses available with enough total memory, the data can be distributed amongst 
them and fetched across a network. 

Pre-set stars. It would useful to generate stars and cells that can be saved 
and re-used, thus skipping the initialization phase altogether. A standard for 
the storing of lens/cell data could be developed by the lensing community, so 
that stars and cells can be stored and then simulated, modified, and re-used as 
necessary. 

Multithreaded version. The new implementation has been built with this 
in mind. There is little data specific to a process that would have to be made 
specific to a thread; the significant shared data is external and can be accessed 
just the same by multiple threads or processes. Therefore a multithreaded 
version could be generated. 

Use a commercial database for cells and stars. The assumption is 
that a commercial database will be more efficient at general file caching and 
perhaps can be tuned to the usage patterns of our simulations. 

File compression. Compressing the stars and cell files on the fly will not 
make a simulation run faster, in fact it will run slower, but the data compresses 
to about 63% of its original size. This would be useful if we run short of 
disk space or go to bigger simulations, but particularly it could be useful for a 
distributed-memory version. 

6. 2. Science 

Beyond computer engineering, there are the scientific aspects of simulating 
microlensing on computers, for example with the use of a cell tree. The cell tree 
is one reason why the original method executed so speedily, but it is also the 
reason for the huge amount of data generated. Instead of using a tree it may 
be possible to use other mo re recent mode ls such as adaptive meshes that are 
used in N-body simulations (jYahagil . I2005T ). 

There are also enhancements that can be made that add more physics to the 
implementation, such as implementing moving sources and moving stars. It is 
clear that both the source and the lens and the objects that make up the lens, 
if it is complex, are moving; and the gravitational potential and light curves are 
changing over time. The ability to simulate time-changing lensing events would 
be a significant advance over all current approaches. 
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7. Conclusion 



We have generated a new version of a ray-shooting microlensing simulation 
tool that can be used to run huge simulations and continue research in quasar 
dark matter microlensing. We have discovered that the ray-shooting is easily 
parallelizable and simple load-balancing works well; parallelizing the stars and 
cell generation is not so easy, and memory caching could be improved with more 
research into the patterns of how the data is accessed. 

Our new approach will now be used to continue research in microlensing; 
based on its performance and usefulness we can determine what modifications 
to pursue for the future. 
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